fix(operator): un-gate readiness on needs-reindex-all, REINDEX DATABASE CONCURRENTLY#69
Merged
Merged
Conversation
…SE CONCURRENTLY The amcheck-driven smart pass introduced in #68 hits the same family of postgres-internal pathology that wedges other vanilla DDL on the prod dataset — bt_index_check itself burns 100% CPU forever on specific indexes (observed via pg_stat_activity: state=active, empty wait_event, query_start growing linearly with wall clock for 4+ minutes on a tiny system catalog index). The smart pass is unusable on this data. Two changes: 1. Drop bt_index_check; revert the needs-reindex-all branch to a blind REINDEX DATABASE per user DB, CONCURRENTLY on PG ≥ 12 so clients can keep using the old indexes during the rebuild and the atomic swap makes corruption disappear without downtime. REINDEX reads the heap and rebuilds the index from scratch — a different code path from amcheck (which reads corrupt index pages directly) — so it isn't subject to the same wedge. Slow on prod-sized DBs (hours) but makes progress. 2. Drop the readiness probe's gate on /pgdata/needs-reindex-all. With the reindex taking hours, gating readiness here trips the operator's 30-minute deployment_ready_timeout and the restore is marked Failed before postgres even has a chance to come up. The probe still gates on /pgdata/needs-reindex (locale-only, finishes in seconds) since that's small and proven. Trade-off: clients hitting a not-yet-reindexed corrupt index in the window between pod-Ready and reindex-complete get the explicit "unexpected zero page" error from postgres. With CONCURRENTLY the window is narrow (queries hit the old index until the atomic swap) and clients can retry. Strictly better than the alternative (permanently failed restore, replica stuck indefinitely on the previous Active). The pre-#68 behaviour for the locale-only path is unchanged.
1f898f7 to
6987f64
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖
Summary
Two changes that together unblock nauru-prod (and any replica whose snapshot triggers
pg_resetwal):needs-reindex-all; replace it with a blindREINDEX DATABASEper user DB — CONCURRENTLY on PG ≥ 12 so clients can keep using the old indexes during the rebuild./pgdata/needs-reindex-all. Keep the gate on the locale-only/pgdata/needs-reindex(small, fast, proven).Why
Verified live on nauru-prod after #68 deployed. Every restore attempt failed at the operator's 30-minute
deployment_ready_timeout. Tracing inside a fresh pod:/pgdata/needs-reindex-allflag present (pg_resetwal did fire — expected on this source).pg_stat_activityshows the background reindex hook running.bt_index_check('pg_type_oid_index'::regclass)(a tiny system catalog index, normally a few KB) — stateactive, emptywait_event, query_start growing linearly with wall clock. 4+ minutes on that one index with no visible progress.Same shape as the
CREATE TABLEspinloop documented earlier on this dataset: vanilla DDL pegs 100% CPU forever inside postgres with no visible reason. The smart pass via amcheck can't avoid the hammer — it IS a hammer that triggers the same wedge.REINDEX DATABASE CONCURRENTLYreads the heap (not the index) to rebuild from scratch — a completely different code path that isn't subject to whatever amcheck is hitting in the corrupt index pages. PG ≥ 12 syntax; falls back to blockingREINDEX DATABASEon older.But REINDEX DATABASE on the prod-sized DBs takes hours. Gating readiness on it makes
deployment_ready_timeout(30 min default) trip every time and the restore fails. So the second change is necessary: let the pod become Ready as soon as postgres accepts connections; let the reindex run in background. CONCURRENTLY means the corrupt indexes stay queryable until the atomic swap, narrowing the window where clients can see corruption.Caveats
REINDEX DATABASE CONCURRENTLYskips system catalogs (PG won't CONCURRENTLY them). For an analytics replica that's the right trade — user-data indexes are what client queries hit; system catalog corruption surfaces as different errors and is rare.Trade-off
Clients hitting a not-yet-reindexed corrupt index get the explicit
unexpected zero page at block Nerror from postgres — same error the downstream users originally reported. With CONCURRENTLY the window is narrow (queries hit the old index until the atomic swap); they retry, succeed once the rebuild lands.Strictly better than the alternative (permanently-failed restore, replica stuck indefinitely on the previous Active).
Shape
builders.rs: the needs-reindex-all branch in the postgres container startup hook replaces the amcheck loop withREINDEX DATABASE CONCURRENTLY "$db"per user database (with a PG < 12 fallback). Comment explains the wedge and why amcheck is wrong here.builders.rs: readiness probe drops theneeds-reindex-allcheck. Comment explains.REINDEX DATABASE CONCURRENTLY, gates onPG_MAJOR ≥ 12, nobt_index_check; readiness probe gates only on the locale-only flag.Recovery
For replicas already in the failure loop (pre-fix gate), the readiness probe will start returning success on its next check after the operator rolls the new image (the
needs-reindex-allfile may still be there, but the probe no longer cares). The background reindex keeps churning. For the current live nauru-prod pod, I removed the flag by hand to unblock immediately — the next restore will exercise the new code path.